Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record and use the tail size to prefetch table tail #11406

Closed
wants to merge 5 commits into from

Conversation

hx235
Copy link
Contributor

@hx235 hx235 commented Apr 24, 2023

Context:
We prefetch the tail part of a SST file (i.e, the blocks after data blocks till the end of the file) during each SST file open in hope to prefetch all the stuff at once ahead of time for later read e.g, footer, meta index, filter/index etc. The existing approach to estimate the tail size to prefetch is through TailPrefetchStats heuristics introduced in #4156, which has caused small reads in unlucky case (e.g, small read into the tail buffer during table open in thread 1 under the same BlockBasedTableFactory object can make thread 2's tail prefetching use a small size that it shouldn't) and is hard to debug. Therefore we decide to record the exact tail size and use it directly to prefetch tail of the SST instead of relying heuristics.

Summary:

  • Obtain and record in manifest the tail size in BlockBasedTableBuilder::Finish()
    • For backward compatibility, we fall back to TailPrefetchStats and last to simple heuristics that the tail size is a linear portion of the file size - see PR conversation for more.
  • Maketail_start_offset part of the table properties and deduct tail size to record in manifest for external files (e.g, file ingestion, import CF) and db repair (with no access to manifest).
  • Use the same tail buffer for index/filter prefetching to reduce file read redundancy

Test:

  1. New UT
  2. db bench
    Note: db bench on /tmp/ where direct read is supported is too slow to finish and the default pinning setting in db bench is not helpful to profile # sst read of Get. Therefore I hacked the following to obtain the following comparison.
diff --git a/table/block_based/block_based_table_reader.cc b/table/block_based/block_based_table_reader.cc
index bd5669f0f..791484c1f 100644
--- a/table/block_based/block_based_table_reader.cc
+++ b/table/block_based/block_based_table_reader.cc
@@ -838,7 +838,7 @@ Status BlockBasedTable::PrefetchTail(
                            &tail_prefetch_size);
 
   // Try file system prefetch
-  if (!file->use_direct_io() && !force_direct_prefetch) {
+  if (false && !file->use_direct_io() && !force_direct_prefetch) {
     if (!file->Prefetch(prefetch_off, prefetch_len, ro.rate_limiter_priority)
              .IsNotSupported()) {
       prefetch_buffer->reset(new FilePrefetchBuffer(
diff --git a/tools/db_bench_tool.cc b/tools/db_bench_tool.cc
index ea40f5fa0..39a0ac385 100644
--- a/tools/db_bench_tool.cc
+++ b/tools/db_bench_tool.cc
@@ -4191,6 +4191,8 @@ class Benchmark {
           std::shared_ptr<TableFactory>(NewCuckooTableFactory(table_options));
     } else {
       BlockBasedTableOptions block_based_options;
+      block_based_options.metadata_cache_options.partition_pinning =
+      PinningTier::kAll;
       block_based_options.checksum =
           static_cast<ChecksumType>(FLAGS_checksum_type);
       if (FLAGS_use_hash_search) {

Create DB

./db_bench -db=/dev/shm/testdb/ --partition_index_and_filters=1 --statistics=1 -benchmarks="fillseq" -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=true -compression_type=none

ReadRandom

./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none

(a) Existing (Use TailPrefetchStats for tail size + use seperate prefetch buffer in PartitionedFilter/IndexReader::CacheDependencies())

rocksdb.table.open.prefetch.tail.hit COUNT : 3395
rocksdb.sst.read.micros P50 : 5.655570 P95 : 9.931396 P99 : 14.845454 P100 : 585.000000 COUNT : 999905 SUM : 6590614

(b) This PR (Record tail size + use the same tail buffer in PartitionedFilter/IndexReader::CacheDependencies())

rocksdb.table.open.prefetch.tail.hit COUNT : 14257
rocksdb.sst.read.micros P50 : 5.173347 P95 : 9.015017 P99 : 12.912610 P100 : 228.000000 COUNT : 998547 SUM : 5976540

As we can see, we increase the prefetch tail hit count and decrease SST read count with this PR

  1. Test backward compatibility by stepping through reading with post-PR code on a db generated pre-PR.

@hx235 hx235 added the WIP Work in progress label Apr 24, 2023
@hx235 hx235 force-pushed the table_prefetch_tail branch 9 times, most recently from ca7ff4e to 55d1966 Compare April 28, 2023 21:09
@hx235 hx235 changed the title Record and use the table offset where data blocks ends to prefetch table tail Record and use the tail start offset where data blocks ends to prefetch table tail Apr 28, 2023
@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 force-pushed the table_prefetch_tail branch from 55d1966 to 0ff26df Compare April 28, 2023 22:15
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

tail_prefetch_size);
} else {
tail_prefetch_size = prefetch_all || preload_all
? static_cast<size_t>(4 * 1024 + 0.01 * file_size)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anand1976 @ajkr I welcome any better heuristics for the case where we don't have tail_prefetch_size e.g, backward compatibility. I used the current one simply because someone provided it in the task about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry a little bit about smooth upgrade. When someone upgrades to this version, they'll initially get this crude heuristic rather than the old adaptive TailPrefetchStats. How hard would it be to leave TailPrefetchStats for a release or two as a fall-back when the new data is not available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - it's not difficult to leave TailPrefetchStats for a couple more release as a fallback and I will fix that.

Copy link
Contributor Author

@hx235 hx235 May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by first falling back to TailPrefetchStats and last to new heuristics static_cast<size_t>(4 * 1024 + 0.01 * file_size)

@hx235
Copy link
Contributor Author

hx235 commented Apr 28, 2023

@ajkr pretty much ready for review except it's currently going through stress test just in case. Will let you know when the stress test is done.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

tail_prefetch_size = prefetch_all || preload_all ? 512 * 1024 : 4 * 1024;
if (tail_start_offset != 0) {
tail_prefetch_size = file_size - tail_start_offset;
} else if (contain_no_data_block) {
Copy link
Contributor Author

@hx235 hx235 Apr 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Note] this is for the case where the file only contains range deletion entries so we have nothing in data blocks where "0" is not a fallback but informed number.

@hx235 hx235 requested a review from ajkr May 1, 2023 19:21
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 force-pushed the table_prefetch_tail branch from 7ec4e4b to 2b5c0e2 Compare May 2, 2023 19:41
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, but some suggestions

HISTORY.md Outdated
@@ -3,6 +3,9 @@
### New Features
* Introduced a new option `block_protection_bytes_per_key`, which can be used to enable per key-value integrity protection for in-memory blocks in block cache (#11287).

### Performance Improvements
* Record the starting offset of block-based table's tail (i.e, all blocks after data blocks till the end) in manifest and use it to prefetch the tail more accurately during table open instead of relying on heuristics (#11406).Heuristics will still be used for now as a fallback for backward compatibility.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"for now"

Perhaps say on files written by previous versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Status s = TryReopen(options);
if (use_direct_io && (s.IsNotSupported() || s.IsInvalidArgument())) {
// If direct IO is not supported, skip the test
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little better to use ROCKSDB_GTEST_BYPASS so that at least when running the test manually you can see if it unexpectedly bypassed the full test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


ASSERT_OK(Put("k1", "v1"));

HistogramData pre_flush_file_read;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit simpler to Reset() the statistics at the start of what you're interested in, but this is also OK

tail_prefetch_size = file_size - tail_start_offset;
} else if (contain_no_data_block) {
tail_prefetch_size = file_size - tail_start_offset;
ROCKS_LOG_WARN(logger,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a warning here. A code comment would be fine IMHO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

tail_prefetch_size = prefetch_all || preload_all ? 512 * 1024 : 4 * 1024;
if (tail_start_offset != 0) {
tail_prefetch_size = file_size - tail_start_offset;
} else if (contain_no_data_block) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be an annoying change at this point, but recording the tail size rather than the offset would not have the same issue with 0 meaning either "unset" or "starting at offset 0." There's also a tiny manifest size advantage as the tail size tends to be much smaller than the tail offset (then encoded with a variable length encoding). Getting rid of tracking contain_no_data_block would probably be nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look into this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if (!s.ok()) {
return s;
if (tail_prefetch_buffer == nullptr || !tail_prefetch_buffer->Enabled()) {
rep->CreateFilePrefetchBuffer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code still covered by unit tests? (I sometimes throw in an assert(false); to test that.)

Copy link
Contributor Author

@hx235 hx235 May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - a couple tests do, including my TEST_P(PrefetchTest, BlockBasedTableTailPrefetch)

@@ -239,6 +240,10 @@ struct TableProperties {
// 0 means not exists.
uint64_t external_sst_file_global_seqno_offset = 0;

// Offset where the "tail" part of SST file starts
// "Tail" refers to all blocks after data blocks till the end of the SST file
uint64_t tail_start_offset = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation for adding this appears to be for SST ingest/import, which is reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, mostly for SST ingest/import/repair

HISTORY.md Outdated
@@ -3,6 +3,9 @@
### New Features
* Introduced a new option `block_protection_bytes_per_key`, which can be used to enable per key-value integrity protection for in-memory blocks in block cache (#11287).

### Performance Improvements
* Record the starting offset of block-based table's tail (i.e, all blocks after data blocks till the end) in manifest and use it to prefetch the tail more accurately during table open instead of relying on heuristics (#11406).Heuristics will still be used for now as a fallback for backward compatibility.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I prefer the release notes not to talk too much about low level details but instead the user experience. For example, "Improved the I/O efficiency of prefetching SST metadata by recording more information in the DB manifest. Opening files written with previous versions will still rely on heuristics for how much to prefetch."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@hx235 hx235 changed the title Record and use the tail start offset where data blocks ends to prefetch table tail Record and use the tail size where data blocks ends to prefetch table tail May 4, 2023
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 changed the title Record and use the tail size where data blocks ends to prefetch table tail Record and use the tail size to prefetch table tail May 5, 2023
@hx235 hx235 force-pushed the table_prefetch_tail branch from 2c4eac5 to 5d05423 Compare May 5, 2023 02:05
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I failed to realize the complication of being unable to determine the tail size at time of writing it out as a table property. We're discussing possible remedies offline.

@hx235 hx235 force-pushed the table_prefetch_tail branch from 5d05423 to 8bd7892 Compare May 5, 2023 22:29
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235
Copy link
Contributor Author

hx235 commented May 5, 2023

I failed to realize the complication of being unable to determine the tail size at time of writing it out as a table property. We're discussing possible remedies offline.

Fixed by preserving tail_start_offset in table properties so ingest/import/repair could deduct tail_size from it.

@facebook-github-bot
Copy link
Contributor

@hx235 merged this pull request in 8f763bd.

facebook-github-bot pushed a commit that referenced this pull request Jun 7, 2023
Summary:
**Context:**
[PR11406](#11406) caused more frequent read during db open reading files with no `tail_size` in the manifest as part of the upgrade to 11406. This is due to that PR introduced
- [smaller](https://github.com/facebook/rocksdb/pull/11406/files#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fR833) prefetch tail buffer size compared to pre-11406 for small files (< 52 MB) when `tail_prefetch_stats` infers tail size to be 0 (usually happens when the stats does not have much historical data to infer early on)
-  more read (up to # of partitioned filter/index) when such small prefetch tail buffer does not contain all the partitioned filter/index needed in CacheDependencies() since the [fallback logic](https://github.com/facebook/rocksdb/pull/11406/files#diff-d98f1a83de24412ad7f3527725dae7e28851c7222622c3cdb832d3cdf24bbf9fR165-R179) that prefetches all partitions at once will be [skipped](url) when such a small prefetch tail buffer is passed in

**Summary:**
- Revert the fallback prefetch buffer size change to preserve existing behavior fully during upgrading in `BlockBasedTable::PrefetchTail()`
- Use passed-in prefetch tail buffer in `CacheDependencies()` only if it has a smaller offset than the the offset of first partition filter/index, that is, at least as good as the existing prefetching behavior

Pull Request resolved: #11516

Test Plan:
- db bench

Create db with small files prior to PR 11406
```
./db_bench -db=/tmp/testdb/ --partition_index_and_filters=1 --statistics=1 -benchmarks=fillseq -key_size=3200 -value_size=5 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=true -compression_type=zstd`
```
Read db to see if post-pr has lower read qps (i.e, rocksdb.file.read.db.open.micros count) during db open.
```
./db_bench -use_direct_reads=1 --file_opening_threads=1 --threads=1 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 --db=/tmp/testdb/ --benchmarks=readrandom --key_size=3200 --value_size=5 --num=100 --disable_auto_compactions=true --compression_type=zstd
```
Pre-PR:
```
rocksdb.file.read.db.open.micros P50 : 3.399023 P95 : 5.924468 P99 : 12.408333 P100 : 29.000000 COUNT : 611 SUM : 2539
```

Post-PR:
```
rocksdb.file.read.db.open.micros P50 : 593.736842 P95 : 861.605263 P99 : 1212.868421 P100 : 2663.000000 COUNT : 585 SUM : 345349
```

_Note: To control the starting offset of the prefetch tail buffer easier, I manually override the following to eliminate the effect of alignment_
```
class PosixRandomAccessFile : public FSRandomAccessFile {
virtual size_t GetRequiredBufferAlignment() const override {
-    return logical_sector_size_;
+    return 1;
  }
 ```

- CI

Reviewed By: pdillinger

Differential Revision: D46472566

Pulled By: hx235

fbshipit-source-id: 2fe14ac8d489d15b0e08e6f8fe4f46d5f110978e
hx235 added a commit that referenced this pull request Jun 7, 2023
Summary:
**Context:**
[PR11406](#11406) caused more frequent read during db open reading files with no `tail_size` in the manifest as part of the upgrade to 11406. This is due to that PR introduced
- [smaller](https://github.com/facebook/rocksdb/pull/11406/files#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fR833) prefetch tail buffer size compared to pre-11406 for small files (< 52 MB) when `tail_prefetch_stats` infers tail size to be 0 (usually happens when the stats does not have much historical data to infer early on)
-  more read (up to # of partitioned filter/index) when such small prefetch tail buffer does not contain all the partitioned filter/index needed in CacheDependencies() since the [fallback logic](https://github.com/facebook/rocksdb/pull/11406/files#diff-d98f1a83de24412ad7f3527725dae7e28851c7222622c3cdb832d3cdf24bbf9fR165-R179) that prefetches all partitions at once will be [skipped](url) when such a small prefetch tail buffer is passed in

**Summary:**
- Revert the fallback prefetch buffer size change to preserve existing behavior fully during upgrading in `BlockBasedTable::PrefetchTail()`
- Use passed-in prefetch tail buffer in `CacheDependencies()` only if it has a smaller offset than the the offset of first partition filter/index, that is, at least as good as the existing prefetching behavior

Pull Request resolved: #11516

Test Plan:
- db bench

Create db with small files prior to PR 11406
```
./db_bench -db=/tmp/testdb/ --partition_index_and_filters=1 --statistics=1 -benchmarks=fillseq -key_size=3200 -value_size=5 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=true -compression_type=zstd`
```
Read db to see if post-pr has lower read qps (i.e, rocksdb.file.read.db.open.micros count) during db open.
```
./db_bench -use_direct_reads=1 --file_opening_threads=1 --threads=1 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 --db=/tmp/testdb/ --benchmarks=readrandom --key_size=3200 --value_size=5 --num=100 --disable_auto_compactions=true --compression_type=zstd
```
Pre-PR:
```
rocksdb.file.read.db.open.micros P50 : 3.399023 P95 : 5.924468 P99 : 12.408333 P100 : 29.000000 COUNT : 611 SUM : 2539
```

Post-PR:
```
rocksdb.file.read.db.open.micros P50 : 593.736842 P95 : 861.605263 P99 : 1212.868421 P100 : 2663.000000 COUNT : 585 SUM : 345349
```

_Note: To control the starting offset of the prefetch tail buffer easier, I manually override the following to eliminate the effect of alignment_
```
class PosixRandomAccessFile : public FSRandomAccessFile {
virtual size_t GetRequiredBufferAlignment() const override {
-    return logical_sector_size_;
+    return 1;
  }
 ```

- CI

Reviewed By: pdillinger

Differential Revision: D46472566

Pulled By: hx235

fbshipit-source-id: 2fe14ac8d489d15b0e08e6f8fe4f46d5f110978e
hx235 added a commit that referenced this pull request Jun 7, 2023
Summary:
**Context:**
[PR11406](#11406) caused more frequent read during db open reading files with no `tail_size` in the manifest as part of the upgrade to 11406. This is due to that PR introduced
- [smaller](https://github.com/facebook/rocksdb/pull/11406/files#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fR833) prefetch tail buffer size compared to pre-11406 for small files (< 52 MB) when `tail_prefetch_stats` infers tail size to be 0 (usually happens when the stats does not have much historical data to infer early on)
-  more read (up to # of partitioned filter/index) when such small prefetch tail buffer does not contain all the partitioned filter/index needed in CacheDependencies() since the [fallback logic](https://github.com/facebook/rocksdb/pull/11406/files#diff-d98f1a83de24412ad7f3527725dae7e28851c7222622c3cdb832d3cdf24bbf9fR165-R179) that prefetches all partitions at once will be [skipped](url) when such a small prefetch tail buffer is passed in

**Summary:**
- Revert the fallback prefetch buffer size change to preserve existing behavior fully during upgrading in `BlockBasedTable::PrefetchTail()`
- Use passed-in prefetch tail buffer in `CacheDependencies()` only if it has a smaller offset than the the offset of first partition filter/index, that is, at least as good as the existing prefetching behavior

Pull Request resolved: #11516

Test Plan:
- db bench

Create db with small files prior to PR 11406
```
./db_bench -db=/tmp/testdb/ --partition_index_and_filters=1 --statistics=1 -benchmarks=fillseq -key_size=3200 -value_size=5 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=true -compression_type=zstd`
```
Read db to see if post-pr has lower read qps (i.e, rocksdb.file.read.db.open.micros count) during db open.
```
./db_bench -use_direct_reads=1 --file_opening_threads=1 --threads=1 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 --db=/tmp/testdb/ --benchmarks=readrandom --key_size=3200 --value_size=5 --num=100 --disable_auto_compactions=true --compression_type=zstd
```
Pre-PR:
```
rocksdb.file.read.db.open.micros P50 : 3.399023 P95 : 5.924468 P99 : 12.408333 P100 : 29.000000 COUNT : 611 SUM : 2539
```

Post-PR:
```
rocksdb.file.read.db.open.micros P50 : 593.736842 P95 : 861.605263 P99 : 1212.868421 P100 : 2663.000000 COUNT : 585 SUM : 345349
```

_Note: To control the starting offset of the prefetch tail buffer easier, I manually override the following to eliminate the effect of alignment_
```
class PosixRandomAccessFile : public FSRandomAccessFile {
virtual size_t GetRequiredBufferAlignment() const override {
-    return logical_sector_size_;
+    return 1;
  }
 ```

- CI

Reviewed By: pdillinger

Differential Revision: D46472566

Pulled By: hx235

fbshipit-source-id: 2fe14ac8d489d15b0e08e6f8fe4f46d5f110978e
hx235 added a commit that referenced this pull request Jun 7, 2023
Summary:
**Context:**
[PR11406](#11406) caused more frequent read during db open reading files with no `tail_size` in the manifest as part of the upgrade to 11406. This is due to that PR introduced
- [smaller](https://github.com/facebook/rocksdb/pull/11406/files#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fR833) prefetch tail buffer size compared to pre-11406 for small files (< 52 MB) when `tail_prefetch_stats` infers tail size to be 0 (usually happens when the stats does not have much historical data to infer early on)
-  more read (up to # of partitioned filter/index) when such small prefetch tail buffer does not contain all the partitioned filter/index needed in CacheDependencies() since the [fallback logic](https://github.com/facebook/rocksdb/pull/11406/files#diff-d98f1a83de24412ad7f3527725dae7e28851c7222622c3cdb832d3cdf24bbf9fR165-R179) that prefetches all partitions at once will be [skipped](url) when such a small prefetch tail buffer is passed in

**Summary:**
- Revert the fallback prefetch buffer size change to preserve existing behavior fully during upgrading in `BlockBasedTable::PrefetchTail()`
- Use passed-in prefetch tail buffer in `CacheDependencies()` only if it has a smaller offset than the the offset of first partition filter/index, that is, at least as good as the existing prefetching behavior

Pull Request resolved: #11516

Test Plan:
- db bench

Create db with small files prior to PR 11406
```
./db_bench -db=/tmp/testdb/ --partition_index_and_filters=1 --statistics=1 -benchmarks=fillseq -key_size=3200 -value_size=5 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=true -compression_type=zstd`
```
Read db to see if post-pr has lower read qps (i.e, rocksdb.file.read.db.open.micros count) during db open.
```
./db_bench -use_direct_reads=1 --file_opening_threads=1 --threads=1 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 --db=/tmp/testdb/ --benchmarks=readrandom --key_size=3200 --value_size=5 --num=100 --disable_auto_compactions=true --compression_type=zstd
```
Pre-PR:
```
rocksdb.file.read.db.open.micros P50 : 3.399023 P95 : 5.924468 P99 : 12.408333 P100 : 29.000000 COUNT : 611 SUM : 2539
```

Post-PR:
```
rocksdb.file.read.db.open.micros P50 : 593.736842 P95 : 861.605263 P99 : 1212.868421 P100 : 2663.000000 COUNT : 585 SUM : 345349
```

_Note: To control the starting offset of the prefetch tail buffer easier, I manually override the following to eliminate the effect of alignment_
```
class PosixRandomAccessFile : public FSRandomAccessFile {
virtual size_t GetRequiredBufferAlignment() const override {
-    return logical_sector_size_;
+    return 1;
  }
 ```

- CI

Reviewed By: pdillinger

Differential Revision: D46472566

Pulled By: hx235

fbshipit-source-id: 2fe14ac8d489d15b0e08e6f8fe4f46d5f110978e
facebook-github-bot pushed a commit that referenced this pull request Jun 16, 2023
Summary:
**Context/Summary:**
When db is upgrading to adopt [pr11406](#11406), it's possible for RocksDB to infer a small tail size to prefetch for pre-upgrade files. Such small tail size would have caused 1 file read per index or filter partition if partitioned index or filer is used. This PR provides a UT to show this would not happen.

Misc: refactor the related UTs a bit to make this new UT more readable.

Pull Request resolved: #11522

Test Plan:
- New UT
If logic of upgrade is wrong e.g,
```
 --- a/table/block_based/partitioned_index_reader.cc
+++ b/table/block_based/partitioned_index_reader.cc
@@ -166,7 +166,8 @@ Status PartitionIndexReader::CacheDependencies(
   uint64_t prefetch_len = last_off - prefetch_off;
   std::unique_ptr<FilePrefetchBuffer> prefetch_buffer;
   if (tail_prefetch_buffer == nullptr || !tail_prefetch_buffer->Enabled() ||
-      tail_prefetch_buffer->GetPrefetchOffset() > prefetch_off) {
+      (false && tail_prefetch_buffer->GetPrefetchOffset() > prefetch_off)) {
```
, then the UT will fail like below
```
[ RUN      ] PrefetchTailTest/PrefetchTailTest.UpgradeToTailSizeInManifest/0
file/prefetch_test.cc:461: Failure
Expected: (db_open_file_read.count) < (num_index_partition), actual: 38 vs 33
Received signal 11 (Segmentation fault)
```

Reviewed By: pdillinger

Differential Revision: D46546707

Pulled By: hx235

fbshipit-source-id: 9897b0a975e9055963edac5451fd1cd9d6c45d0e
dongjoon-hyun pushed a commit to apache/spark that referenced this pull request Jul 1, 2023
### What changes were proposed in this pull request?
This pr aims to upgrade rocksdbjni from 8.1.1.1 to 8.3.2.

### Why are the changes needed?
The release notes: https://github.com/facebook/rocksdb/releases/tag/v8.3.2
- Bug Fixes
Reduced cases of illegally using Env::Default() during static destruction by never destroying the internal PosixEnv itself (except for builds checking for memory leaks). (facebook/rocksdb#11538)

- Performance Improvements
Fixed higher read QPS during DB::Open() reading files created prior to facebook/rocksdb#11406, especially when reading many small file (size < 52 MB) during DB::Open() and partitioned filter or index is used.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Pass GA.
- Manual test: `org.apache.spark.util.kvstore.RocksDBBenchmark`

**A.8.1.1.1**
```
                                        	count	mean	min	max	95th
dbClose                                 	4	0.472	0.413	0.527	0.527
dbCreation                              	4	263.388	42.032	922.916	922.916
naturalIndexCreateIterator              	1024	0.005	0.001	2.646	0.004
naturalIndexDescendingCreateIterator    	1024	0.002	0.002	0.055	0.002
naturalIndexDescendingIteration         	1024	0.003	0.003	0.021	0.004
naturalIndexIteration                   	1024	0.009	0.003	3.156	0.012
randomDeleteIndexed                     	1024	0.017	0.013	0.381	0.023
randomDeletesNoIndex                    	1024	0.010	0.009	0.032	0.011
randomUpdatesIndexed                    	1024	0.066	0.025	19.900	0.074
randomUpdatesNoIndex                    	1024	0.017	0.015	0.380	0.019
randomWritesIndexed                     	1024	0.097	0.024	52.970	0.093
randomWritesNoIndex                     	1024	0.019	0.015	1.101	0.021
refIndexCreateIterator                  	1024	0.002	0.002	0.044	0.002
refIndexDescendingCreateIterator        	1024	0.001	0.001	0.013	0.001
refIndexDescendingIteration             	1024	0.004	0.003	0.070	0.005
refIndexIteration                       	1024	0.005	0.003	0.230	0.005
sequentialDeleteIndexed                 	1024	0.016	0.013	0.104	0.022
sequentialDeleteNoIndex                 	1024	0.011	0.009	0.044	0.011
sequentialUpdatesIndexed                	1024	0.027	0.019	0.660	0.050
sequentialUpdatesNoIndex                	1024	0.025	0.016	0.523	0.033
sequentialWritesIndexed                 	1024	0.030	0.020	1.526	0.040
sequentialWritesNoIndex                 	1024	0.030	0.017	4.410	0.035
```
**B.8.3.2**
```
                                        	count	mean	min	max	95th
dbClose                                 	4	0.488	0.424	0.556	0.556
dbCreation                              	4	241.375	35.710	850.488	850.488
naturalIndexCreateIterator              	1024	0.004	0.001	1.555	0.006
naturalIndexDescendingCreateIterator    	1024	0.002	0.002	0.064	0.002
naturalIndexDescendingIteration         	1024	0.004	0.003	0.035	0.004
naturalIndexIteration                   	1024	0.011	0.003	4.464	0.012
randomDeleteIndexed                     	1024	0.018	0.013	0.505	0.024
randomDeletesNoIndex                    	1024	0.010	0.009	0.025	0.011
randomUpdatesIndexed                    	1024	0.065	0.024	20.210	0.077
randomUpdatesNoIndex                    	1024	0.019	0.015	0.449	0.027
randomWritesIndexed                     	1024	0.087	0.026	38.782	0.096
randomWritesNoIndex                     	1024	0.019	0.016	1.040	0.019
refIndexCreateIterator                  	1024	0.002	0.002	0.051	0.002
refIndexDescendingCreateIterator        	1024	0.001	0.001	0.013	0.001
refIndexDescendingIteration             	1024	0.004	0.003	0.064	0.004
refIndexIteration                       	1024	0.005	0.003	0.253	0.006
sequentialDeleteIndexed                 	1024	0.016	0.013	0.286	0.024
sequentialDeleteNoIndex                 	1024	0.010	0.009	0.071	0.012
sequentialUpdatesIndexed                	1024	0.026	0.020	0.711	0.042
sequentialUpdatesNoIndex                	1024	0.024	0.016	0.455	0.030
sequentialWritesIndexed                 	1024	0.025	0.020	1.359	0.033
sequentialWritesNoIndex                 	1024	0.028	0.017	4.354	0.030
```

    B.Checked core module UTs with rocksdb live ui
    ```
    export LIVE_UI_LOCAL_STORE_DIR=/${basedir}/spark-ui
    build/mvn clean install -pl core -am -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest -fn
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD SUCCESS
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time:  39:09 min
    [INFO] Finished at: 2023-06-30T09:48:44+08:00
    [INFO] ------------------------------------------------------------------------
    ```

Closes #41802 from panbingkun/SPARK-44256.

Authored-by: panbingkun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@v01dstar v01dstar mentioned this pull request Oct 22, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants